Below, the different columns available in the database are listed and explained:
'name': Name of player'position': Position on the pitch (Goalkeeper, Defender, Midfielder, Forward)'team': Premier League team with which the player is affiliated'xP': Expected points for a the player in the given fixture'assists': Actual number of assists'bonus': Actual number of bonus points awarded'bps': Stands for 'Bonus Points System', a raw score based on performance metrics like goals, assists, clean sheets, saves, tackles, and other contributions that is used to rank players and determine'bonus'scores.'clean_sheets': Boolean column identifying whether the player earned points for a clean sheet (i.e., his team conceded zero goals while he was on the pitch).'creativity': A measure of a playerβs potential to create scoring opportunities (passes, crosses, etc.).'element': A unique ID for the player in the FPL system.'fixture': The ID of the match the player participated in.'goals_conceded': The number of goals the player's team conceded while they were on the pitch.'goals_scored': The number of goals scored by the player.'Influence_Creativity_Threat_Index': A combined metric summarizing the player's influence, creativity, and threat.'influence': A measure of a playerβs impact on a match (defensive and offensive contributions).'kickoff_time': The start time of the match.'minutes': The number of minutes the player was on the pitch during the match.'opponent_team': The ID of the opposing team in the fixture.'own_goals': The number of own goals scored by the player.'penalties_missed': The number of penalty kicks missed by the player.'penalties_saved': The number of penalty kicks saved by the player (goalkeepers only).'red_cards': The number of red cards received by the player.'round': The fantasy round number of the match.'saves': The number of saves made by the player (goalkeepers only).'selected': The number of FPL managers who selected the player for their teams in this round.'team_a_score': The number of goals scored by the away team in the match.'team_h_score': The number of goals scored by the home team in the match.'threat': A measure of a playerβs likelihood of scoring goals based on their attacking actions.'total_points': The total FPL points earned by the player in the match.This column will be our label.'transfers_balance': The net number of transfers for the player (transfers in minus transfers out).'transfers_in': The number of FPL teams that transferred the player in before this match.'transfers_out': The number of FPL teams that transferred the player out before this match.'value': The playerβs price in FPL (in millions GBP).'was_home': A boolean indicating if the player's team was playing at home (True/1 = home, False/0 = away).'yellow_cards': The number of yellow cards received by the player.'GW': The specific gameweek for the match.'expected_goals': A metric predicting the likelihood of the player scoring based on their chances.'expected_assists': A metric predicting the likelihood of the player assisting a goal.'expected_goal_involvements': The sum of'expected_goals'and'expected_assists', representing the playerβs total expected goal contributions.
Import PackagesΒΆ
Imports essential Python libraries and machine learning tools for data analysis, visualization, and model evaluation, as well as functions for splitting data into training and testing sets. These are typically used in machine learning projects to build and assess predictive models.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import matplotlib.lines as mlines
import ast
import unicodedata
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')
Import DataΒΆ
Let's start by importing the 'master.csv' file into a dataframe.
master = pd.read_csv('../master.csv')
master.head()
| name | position | team | xP | assists | bonus | bps | clean_sheets | creativity | element | ... | transfers_balance | transfers_in | transfers_out | value | was_home | yellow_cards | GW | expected_goals | expected_assists | expected_goal_involvements | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Connolly | FWD | Brighton | 0.5 | 0 | 0 | -3 | 0 | 0.3 | 78 | ... | 0 | 0 | 0 | 55 | True | 0 | 1 | 0.392763 | 0.000000 | 0.392763 |
| 1 | Aaron Cresswell | DEF | West Ham | 2.1 | 0 | 0 | 11 | 0 | 11.2 | 435 | ... | 0 | 0 | 0 | 50 | True | 0 | 1 | 0.000000 | 0.000000 | 0.000000 |
| 2 | Aaron Mooy | MID | Brighton | 0.0 | 0 | 0 | 0 | 0 | 0.0 | 60 | ... | 0 | 0 | 0 | 50 | True | 0 | 1 | NaN | NaN | NaN |
| 3 | Aaron Ramsdale | GK | Sheffield Utd | 2.5 | 0 | 0 | 12 | 0 | 0.0 | 483 | ... | 0 | 0 | 0 | 50 | True | 0 | 1 | 0.000000 | 0.000000 | 0.000000 |
| 4 | Abdoulaye DoucourAΒ© | MID | Everton | 1.3 | 0 | 0 | 20 | 1 | 44.6 | 512 | ... | 0 | 0 | 0 | 55 | False | 0 | 1 | 0.000000 | 0.205708 | 0.205708 |
5 rows Γ 39 columns
master.shape
(111920, 39)
master.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 111920 entries, 0 to 111919 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 111920 non-null object 1 position 111920 non-null object 2 team 111920 non-null object 3 xP 111920 non-null float64 4 assists 111920 non-null int64 5 bonus 111920 non-null int64 6 bps 111920 non-null int64 7 clean_sheets 111920 non-null int64 8 creativity 111920 non-null float64 9 element 111920 non-null int64 10 fixture 111920 non-null int64 11 goals_conceded 111920 non-null int64 12 goals_scored 111920 non-null int64 13 Influence_Creativity_Threat_Index 111920 non-null float64 14 influence 111920 non-null float64 15 kickoff_time 111920 non-null object 16 minutes 111920 non-null int64 17 opponent_team 111920 non-null int64 18 own_goals 111920 non-null int64 19 penalties_missed 111920 non-null int64 20 penalties_saved 111920 non-null int64 21 red_cards 111920 non-null int64 22 round 111920 non-null int64 23 saves 111920 non-null int64 24 selected 111920 non-null int64 25 team_a_score 111920 non-null int64 26 team_h_score 111920 non-null int64 27 threat 111920 non-null float64 28 total_points 111920 non-null int64 29 transfers_balance 111920 non-null int64 30 transfers_in 111920 non-null int64 31 transfers_out 111920 non-null int64 32 value 111920 non-null int64 33 was_home 111920 non-null bool 34 yellow_cards 111920 non-null int64 35 GW 111920 non-null int64 36 expected_goals 79272 non-null float64 37 expected_assists 79272 non-null float64 38 expected_goal_involvements 79272 non-null float64 dtypes: bool(1), float64(8), int64(26), object(4) memory usage: 32.6+ MB
Data Exploration and CleaningΒΆ
Aligning Categorical ColumnsΒΆ
Let's examine the columns that contain categorical data, including how many unique values each contains and what those unique values are.
# Define categorical columns
categorical_columns = ['position', 'team']
# Calculate how many unique values there are for each categorical column
unique_vals = master[categorical_columns].nunique()
print("Unique values in categorical features:")
print(unique_vals)
# Print different categorical values
for i in categorical_columns:
print(f"\nUnique values in '{i}':")
print(master[i].unique())
Unique values in categorical features: position 5 team 27 dtype: int64 Unique values in 'position': ['FWD' 'DEF' 'MID' 'GK' 'GKP'] Unique values in 'team': ['Brighton' 'West Ham' 'Sheffield Utd' 'Everton' 'Fulham' 'Wolves' 'Leeds' 'Leicester' 'Liverpool' 'West Brom' 'Arsenal' 'Southampton' 'Newcastle' 'Chelsea' 'Crystal Palace' 'Spurs' 'Man Utd' 'Man City' 'Aston Villa' 'Burnley' 'Watford' 'Norwich' 'Brentford' 'Bournemouth' "Nott'm Forest" 'Luton' 'Ipswich']
As we can see, there are 5 unique player positions: 'FWD' = Forward, 'DEF' = Defender, 'MID' = Midfielder, and 'GK/GKP' = Goalkeeper. Both GK and GKP refer to the same position but are written differently due to different syntax formats across different seasons, so we will have to align them into one value: 'GK'. We will also change the abbreviated name of Nottingham Forest in the team column, 'Nott'm Forest', to the full name: 'Nottingham Forest'.
master['position'] = master['position'].replace('GKP', 'GK') # GKP --> GK
master['team'] = master['team'].replace("Nott'm Forest", "Nottingham Forest") # Nott'm Forest --> Nottingham Forest
# Calculate how many unique values there are for each categorical column
unique_vals = master[categorical_columns].nunique()
print("Unique values in categorical features:")
print(unique_vals)
# Print different categorical values
for i in categorical_columns:
print(f"\nUnique values in '{i}':")
print(master[i].unique())
Unique values in categorical features: position 4 team 27 dtype: int64 Unique values in 'position': ['FWD' 'DEF' 'MID' 'GK'] Unique values in 'team': ['Brighton' 'West Ham' 'Sheffield Utd' 'Everton' 'Fulham' 'Wolves' 'Leeds' 'Leicester' 'Liverpool' 'West Brom' 'Arsenal' 'Southampton' 'Newcastle' 'Chelsea' 'Crystal Palace' 'Spurs' 'Man Utd' 'Man City' 'Aston Villa' 'Burnley' 'Watford' 'Norwich' 'Brentford' 'Bournemouth' 'Nottingham Forest' 'Luton' 'Ipswich']
Encoding the Categorical FeaturesΒΆ
Now, to encode the categorical features. We will one-hot encode the 'position' column, since there are only 4 unique values. However, we will label-encode the 'team' column and create a new column: 'team_label'. We do this because we have 27 unique teams in the database, and one-hot encoding 'team' would increase dimensionality substantially. Furthermore, our ML model will be a Random Forest, which does not infer ordinality, so label encoding will not be an issue.
# One-hot encode the 'position' column while retaining the original column
position_dummies = pd.get_dummies(master['position'], prefix='position')
master = pd.concat([master, position_dummies], axis=1)
# Label encode the 'team' column while retaining the original column
le = LabelEncoder()
master['team_label'] = le.fit_transform(master['team'])
master.head()
| name | position | team | xP | assists | bonus | bps | clean_sheets | creativity | element | ... | yellow_cards | GW | expected_goals | expected_assists | expected_goal_involvements | position_DEF | position_FWD | position_GK | position_MID | team_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Connolly | FWD | Brighton | 0.5 | 0 | 0 | -3 | 0 | 0.3 | 78 | ... | 0 | 1 | 0.392763 | 0.000000 | 0.392763 | False | True | False | False | 4 |
| 1 | Aaron Cresswell | DEF | West Ham | 2.1 | 0 | 0 | 11 | 0 | 11.2 | 435 | ... | 0 | 1 | 0.000000 | 0.000000 | 0.000000 | True | False | False | False | 25 |
| 2 | Aaron Mooy | MID | Brighton | 0.0 | 0 | 0 | 0 | 0 | 0.0 | 60 | ... | 0 | 1 | NaN | NaN | NaN | False | False | False | True | 4 |
| 3 | Aaron Ramsdale | GK | Sheffield Utd | 2.5 | 0 | 0 | 12 | 0 | 0.0 | 483 | ... | 0 | 1 | 0.000000 | 0.000000 | 0.000000 | False | False | True | False | 20 |
| 4 | Abdoulaye DoucourAΒ© | MID | Everton | 1.3 | 0 | 0 | 20 | 1 | 44.6 | 512 | ... | 0 | 1 | 0.000000 | 0.205708 | 0.205708 | False | False | False | True | 8 |
5 rows Γ 44 columns
Converting to DateTime, Extracting Time Elements, and adding a Season Identifier ColumnΒΆ
Now let us convert the 'kickoff_time' column into datetime format, extract the 'Hour', 'DayOfWeek', 'Month', 'Weekend', and 'WeekOfYear' elements, and store them into new columns in the 'master' df. We will also create a column called 'Season' that will classify each row in the correct Premier League season between 2020-2021 and 2024-2025, based on the 'kickoff_time'.
master['kickoff_time'] = pd.to_datetime(master['kickoff_time']).dt.tz_localize(None)
master['Hour'] = master['kickoff_time'].dt.hour # Extract hour
master['DayOfWeek'] = master['kickoff_time'].dt.dayofweek # Extract day of week (Monday = 0 to Sunday = 6)
master['Weekend'] = master['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0) # Determine if weekend (1 if yes, 0 if no)
master['WeekOfYear'] = master['kickoff_time'].dt.isocalendar().week # Extract week of year
master['Month'] = master['kickoff_time'].dt.month # Extract month
master['Year'] = master['kickoff_time'].dt.year # Extract year
# Define the function to assign seasons
def assign_season(kickoff_time):
if pd.Timestamp('2020-08-01') <= kickoff_time <= pd.Timestamp('2021-05-31'):
return '2020-2021'
elif pd.Timestamp('2021-08-01') <= kickoff_time <= pd.Timestamp('2022-05-31'):
return '2021-2022'
elif pd.Timestamp('2022-08-01') <= kickoff_time <= pd.Timestamp('2023-05-31'):
return '2022-2023'
elif pd.Timestamp('2023-08-01') <= kickoff_time <= pd.Timestamp('2024-05-31'):
return '2023-2024'
elif pd.Timestamp('2024-08-01') <= kickoff_time <= pd.Timestamp('2025-05-31'):
return '2024-2025'
else:
return None # If the date doesn't fall into any range
# Apply the function to create the 'season' column
master['Season'] = master['kickoff_time'].apply(assign_season)
master.head()
| name | position | team | xP | assists | bonus | bps | clean_sheets | creativity | element | ... | position_GK | position_MID | team_label | Hour | DayOfWeek | Weekend | WeekOfYear | Month | Year | Season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Connolly | FWD | Brighton | 0.5 | 0 | 0 | -3 | 0 | 0.3 | 78 | ... | False | False | 4 | 19 | 0 | 0 | 38 | 9 | 2020 | 2020-2021 |
| 1 | Aaron Cresswell | DEF | West Ham | 2.1 | 0 | 0 | 11 | 0 | 11.2 | 435 | ... | False | False | 25 | 19 | 5 | 1 | 37 | 9 | 2020 | 2020-2021 |
| 2 | Aaron Mooy | MID | Brighton | 0.0 | 0 | 0 | 0 | 0 | 0.0 | 60 | ... | False | True | 4 | 19 | 0 | 0 | 38 | 9 | 2020 | 2020-2021 |
| 3 | Aaron Ramsdale | GK | Sheffield Utd | 2.5 | 0 | 0 | 12 | 0 | 0.0 | 483 | ... | True | False | 20 | 17 | 0 | 0 | 38 | 9 | 2020 | 2020-2021 |
| 4 | Abdoulaye DoucourAΒ© | MID | Everton | 1.3 | 0 | 0 | 20 | 1 | 44.6 | 512 | ... | False | True | 8 | 15 | 6 | 1 | 37 | 9 | 2020 | 2020-2021 |
5 rows Γ 51 columns
Decoding Player Names into Conventional Alphabetical FormatΒΆ
Now, we should filter the 'name' column to remove any symbols/non-alphabetical characters, to decode the text into normal alphabet format.
def remove_accents(df):
''' Replace recurring symbols with their alphanumeric counterparts '''
df['name'] = df['name'].str.replace('AΒ©', 'Γ©', regex=False)
df['name'] = df['name'].str.replace('AΒ§', 'Γ§', regex=False)
df['name'] = df['name'].str.replace('AΒ', 'Γ', regex=False)
df['name'] = df['name'].str.replace('A3', 'Γ³', regex=False)
df['name'] = df['name'].str.replace('AΒΆ', 'ΓΆ', regex=False)
df['name'] = df['name'].str.replace('A1β4', 'ΓΌ', regex=False)
df['name'] = df['name'].str.replace('AΒ€', 'Γ€', regex=False)
df['name'] = df['name'].str.replace('AΒ«', 'Γ«', regex=False)
df['name'] = df['name'].str.replace('AΒ£', 'Γ£', regex=False)
df['name'] = df['name'].str.replace('A\x87', 'Δ', regex=False)
df['name'] = df['name'].str.replace('A\x98', 'O', regex=False)
df['name'] = df['name'].str.replace('A\x82', 'l', regex=False)
''' Deal with specific outlier names '''
df['name'] = df['name'].str.replace('FernAΒ‘ndez', 'FernΓ‘ndez', regex=False)
df['name'] = df['name'].str.replace('Marek RodAΒ‘k', 'Marek RodΓ‘k', regex=False)
df['name'] = df['name'].str.replace('GroA', 'GroΓ', regex=False)
df['name'] = df['name'].str.replace('Davinson SAΒ‘nchez', 'Davinson SΓ‘nchez', regex=False)
df['name'] = df['name'].str.replace('Cengiz AΒnder', 'Cengiz Under', regex=False)
df['name'] = df['name'].str.replace('FabiAΒ‘n Balbuena', 'FabiΓ‘n Balbuena', regex=False)
df['name'] = df['name'].str.replace('Robert SAΒ‘nchez', 'Robert SΓ‘nchez', regex=False)
df['name'] = df['name'].str.replace('SaAol AΒiguez', 'SaΓΊl NΓguez', regex=False)
df['name'] = df['name'].str.replace('AΒlvaro', 'Alvaro', regex=False)
df['name'] = df['name'].str.replace('Son Heung-min', 'Heung-Min Son', regex=False)
df['name'] = df['name'].str.replace('AdriAΒ‘n San Miguel del Castillo', 'AdriΓ‘n San Miguel del Castillo', regex=False)
df['name'] = df['name'].str.replace('A\x96zil', 'Ozil', regex=False)
df['name'] = df['name'].str.replace('A\x87aglar', 'Caglar', regex=False)
df['name'] = df['name'].str.replace('AdriAΒ‘n BernabΓ©', 'AdriΓ‘n BernabΓ©', regex=False)
df['name'] = df['name'].str.replace('NicolAΒ‘s Otamendi', 'NicolΓ‘s Otamendi', regex=False)
df['name'] = df['name'].str.replace('Thiago AlcAΒ‘ntara', 'Thiago AlcΓ‘ntara', regex=False)
df['name'] = df['name'].str.replace('SaA d Benrahma', 'Said Benrahma', regex=False)
df['name'] = df['name'].str.replace('ImrAΒ’n', 'Imran', regex=False)
df['name'] = df['name'].str.replace('DerviA\x9foA\x9flu', 'DerviΕoΔlu', regex=False)
df['name'] = df['name'].str.replace('Francisco Jorge TomAΒ‘s Oliveira', 'Francisco Jorge TomΓ‘s Oliveira', regex=False)
df['name'] = df['name'].str.replace('Benjamin Chilwell', 'Ben Chilwell', regex=False)
df['name'] = df['name'].str.replace('Emiliano MartΓnez Romero', 'Emiliano MartΓnez', regex=False)
df['name'] = df['name'].str.replace('Gabriel dos Santos MagalhΓ£es', 'Gabriel MagalhΓ£es', regex=False)
df['name'] = df['name'].str.replace('Gabriel Teodoro Martinelli Silva', 'Gabriel Martinelli', regex=False)
df['name'] = df['name'].str.replace('Gabriel Martinelli Silva', 'Gabriel Martinelli', regex=False)
df['name'] = df['name'].str.replace('Joelinton CAΒ‘ssio ApolinAΒ‘rio de Lira', 'Joelinton', regex=False)
df['name'] = df['name'].str.replace('Matteo Guendouzi', 'MattΓ©o Guendouzi', regex=False)
df['name'] = df['name'].str.replace('Romain SaA ss', 'Romain SaΓ―ss', regex=False)
df['name'] = df['name'].str.replace('Pablo HernAΒ‘ndez DomΓnguez', 'Pablo HernΓ‘ndez DomΓnguez', regex=False)
df['name'] = df['name'].str.replace('RAoben Diogo da Silva Neves', 'RΓΊben da Silva Neves', regex=False)
df['name'] = df['name'].str.replace('Paulo Gazzaniga Farias', 'Paulo Gazzaniga', regex=False)
df['name'] = df['name'].str.replace('Tanguy NdombΓ©lΓ© Alvaro', 'Tanguy Ndombele', regex=False)
df['name'] = df['name'].str.replace('Bruno Borges Fernandes', 'Bruno Fernandes', regex=False)
df['name'] = df['name'].str.replace('Bruno Miguel Borges Fernandes', 'Bruno Fernandes', regex=False)
return df
master = remove_accents(master)
DuplicatesΒΆ
Let's also examine the data for the presence of duplicate rows.
print(f"Number of duplicate rows in the master data set: {master.duplicated().sum()}")
Number of duplicate rows in the master data set: 0
Nice, no duplicates! Also, this makes sense, since each row in our data represents a unique player in a unique gameweek in a unique fixture, so duplicates would have indicated errors in data sourcing.
Missing Values / NaNsΒΆ
First, let's explore how many missing values exist in the master dataframe:
# Determine missing values for each column
missing_values = master.isnull().sum()
# Create a df of missing values
missing_df = pd.DataFrame({'Missing Values': missing_values})
# Show all rows
pd.set_option('display.max_rows', None) # Show all rows
missing_df
| Missing Values | |
|---|---|
| name | 0 |
| position | 0 |
| team | 0 |
| xP | 0 |
| assists | 0 |
| bonus | 0 |
| bps | 0 |
| clean_sheets | 0 |
| creativity | 0 |
| element | 0 |
| fixture | 0 |
| goals_conceded | 0 |
| goals_scored | 0 |
| Influence_Creativity_Threat_Index | 0 |
| influence | 0 |
| kickoff_time | 0 |
| minutes | 0 |
| opponent_team | 0 |
| own_goals | 0 |
| penalties_missed | 0 |
| penalties_saved | 0 |
| red_cards | 0 |
| round | 0 |
| saves | 0 |
| selected | 0 |
| team_a_score | 0 |
| team_h_score | 0 |
| threat | 0 |
| total_points | 0 |
| transfers_balance | 0 |
| transfers_in | 0 |
| transfers_out | 0 |
| value | 0 |
| was_home | 0 |
| yellow_cards | 0 |
| GW | 0 |
| expected_goals | 32648 |
| expected_assists | 32648 |
| expected_goal_involvements | 32648 |
| position_DEF | 0 |
| position_FWD | 0 |
| position_GK | 0 |
| position_MID | 0 |
| team_label | 0 |
| Hour | 0 |
| DayOfWeek | 0 |
| Weekend | 0 |
| WeekOfYear | 0 |
| Month | 0 |
| Year | 0 |
| Season | 0 |
So, we see that our efforts to impute/drop missing values will have to focus on three main features: 'expected_goals', 'expected_assists', and 'expected_goal_involvements'. We will go ahead and impute these missing values with the mean of that player's xG/xA/xGI for that specific season, to provide a temporally appropriate context for the substitution. If no values exist in that season, we will impute with the mean of that player's xG/xA/xGI across all seasons.
# Function to impute missing values for grouped data
def impute_mean_per_group(df, group_cols):
# Identify columns with missing values
missing_columns = df.columns[df.isnull().any()]
for col in missing_columns:
# Step 1: Impute missing values using the mean for each group (name, season)
df[col] = df.groupby(group_cols)[col].transform(
lambda group: group.fillna(group.mean())
)
# Step 2: Handle edge cases where the group mean couldn't be calculated
# Fall back to mean for the player across all seasons
df[col] = df.groupby('name')[col].transform(
lambda group: group.fillna(group.mean())
)
return df
# Apply the imputation
master = impute_mean_per_group(master, ['name', 'Season'])
# Determine missing values for each column
missing_values = master.isnull().sum()
# Create a df of missing values
missing_df = pd.DataFrame({'Missing Values': missing_values})
# Show all rows
pd.set_option('display.max_rows', None) # Show all rows
missing_df
| Missing Values | |
|---|---|
| name | 0 |
| position | 0 |
| team | 0 |
| xP | 0 |
| assists | 0 |
| bonus | 0 |
| bps | 0 |
| clean_sheets | 0 |
| creativity | 0 |
| element | 0 |
| fixture | 0 |
| goals_conceded | 0 |
| goals_scored | 0 |
| Influence_Creativity_Threat_Index | 0 |
| influence | 0 |
| kickoff_time | 0 |
| minutes | 0 |
| opponent_team | 0 |
| own_goals | 0 |
| penalties_missed | 0 |
| penalties_saved | 0 |
| red_cards | 0 |
| round | 0 |
| saves | 0 |
| selected | 0 |
| team_a_score | 0 |
| team_h_score | 0 |
| threat | 0 |
| total_points | 0 |
| transfers_balance | 0 |
| transfers_in | 0 |
| transfers_out | 0 |
| value | 0 |
| was_home | 0 |
| yellow_cards | 0 |
| GW | 0 |
| expected_goals | 9772 |
| expected_assists | 9772 |
| expected_goal_involvements | 9772 |
| position_DEF | 0 |
| position_FWD | 0 |
| position_GK | 0 |
| position_MID | 0 |
| team_label | 0 |
| Hour | 0 |
| DayOfWeek | 0 |
| Weekend | 0 |
| WeekOfYear | 0 |
| Month | 0 |
| Year | 0 |
| Season | 0 |
As we can see, we still have almost 10,000 missing values in each of 'expected_goals', 'expected_assists', and 'expected_goal_involvements'. These exist even after trying to impute based on both season/player context and player context. Therefore, we will go ahead and drop the rows with the remaining missing values.
# Define the columns to check for missing values
master = master.dropna()
master.shape
(102148, 51)
Filtering Out Rows with Zero or Limited Minutes PlayedΒΆ
Now, we will filter out rows where the number of minutes played, 'minutes', is zero. The original 'master' dataframe includes all players in a season, including those who are listed in a team's squad but do not play (those on the bench).
We should also filter out rows with limited 'minutes' of gameplay, because they tend to have incomplete or missing data. They may also bias our model and analysis by including late-game strategies (i.e., some managers might substitute a player in at the end of a game where his team is leading to increase defensive posture and maintain their lead). However, in order to avoid favoring early starters, we need to strike a balance in choosing the 'minutes' threshold.
We will proceed by filtering out players with less than 5 minutes of gameplay.
# Print original length
print('Original Length of master dataframe: ', master.shape[0])
# Print the number of rows with minutes = 0
print(f"Number of rows with minutes = 0: {master[master['minutes'] == 0].shape[0]}")
# Print the number of rows with minutes between 0 and 5
print(f"Number of rows with minutes between 0 and 5: {master[(master['minutes'] > 0) & (master['minutes'] < 5)].shape[0]}")
Original Length of master dataframe: 102148 Number of rows with minutes = 0: 57408 Number of rows with minutes between 0 and 5: 1773
As we can see, a large portion of the database (57,408 rows or 56.2%) represented players who were not active in a given fixture. By filtering these out, we can focus on players with concrete contributions when creating visualizations and designing our model. Furthermore, players with between 0 and 5 minutes of gameplay will also be dropped, and they represent a much smaller number of entries (1773 rows).
master = master[master['minutes'] >= 5]
print('Length of master dataframe after filtering out players with 0-5 minutes of gameplay:', master.shape[0])
Length of master dataframe after filtering out players with 0-5 minutes of gameplay: 42967
Dealing with Invalid RowsΒΆ
Now, let us examine invalid occurrences of expected goal metrics. If a player's 'goals_scored' are greater than zero, then that row's 'expected_goals' cannot be zero. Therefore, we need to examine the dataframe for those conditions and impute 'expected_goals' to be the average of the rows that do not meet this condition for that player in that season.
# Filter rows where goals_scored > 0 and expected_goals == 0
invalid_rows = master[(master['goals_scored'] > 0) & (master['expected_goals'] == 0)]
# Count how many times this happens
invalid_count = len(invalid_rows)
print(f"Number of rows where goals_scored > 0 but expected_goals = 0: {invalid_count}")
Number of rows where goals_scored > 0 but expected_goals = 0: 373
# Impute expected_goals with the mean for that player and season,
# with fallback to player-level or global mean
def impute_expected_goals(row):
if row['goals_scored'] > 0 and row['expected_goals'] == 0:
# Calculate the mean of expected_goals for the player and season
season_mean = master[
(master['name'] == row['name']) &
(master['Season'] == row['Season']) &
(master['expected_goals'] > 0)
]['expected_goals'].mean()
# Fallback to the mean for the player across all seasons
if pd.isna(season_mean):
player_mean = master[
(master['name'] == row['name']) &
(master['expected_goals'] > 0)
]['expected_goals'].mean()
return player_mean if pd.notna(player_mean) else master['expected_goals'].mean()
return season_mean
else:
return row['expected_goals'] # Leave unchanged
# Apply the imputation
master['expected_goals'] = master.apply(impute_expected_goals, axis=1)
# Verify the result
rows_with_condition = master[(master['goals_scored'] > 0) & (master['expected_goals'] == 0)]
print(f"Number of rows where goals_scored > 0 but expected_goals = 0: {len(rows_with_condition)}")
Number of rows where goals_scored > 0 but expected_goals = 0: 0
Negative ValuesΒΆ
Now, we need to check for negative values
for column in master.columns:
# Check if the column is numeric
if master[column].dtype in ['int64', 'float64']:
# Filter rows with negative values
negatives = master[master[column] < 0]
if not negatives.empty:
print(f"Negative values found in column '{column}':")
print(len(negatives))
print("\n")
Negative values found in column 'xP': 1621 Negative values found in column 'bps': 1490 Negative values found in column 'total_points': 502 Negative values found in column 'transfers_balance': 21739
'total_points', 'bps', and 'transfers_balance' can have negative values. Players can be penalized for events like own goals, red cards, and goals conceded, so those negatives can and should be retained. In addition, 'transfers_balance' is a net figure that represents transfers in minus transfers out, so no issue with negatives here either.
However, 'xP' values are generally non-negative because they are probabilities multiplied by point weights. Negative xPs can indicate errors in data sourcing. Negative 'xP' will thus be imputed with the average of that player's xP from the gameweeks before and after the negative value. Care will be taken so that there is no jumping between seasons, since the 'master' dataframe is a concatenation of several seasons. If adjacent xPs for that same player are also negative, it will be replaced with the closest neighbor.
# Replace negative xP values with NaN
master.loc[master['xP'] < 0, 'xP'] = np.nan
# Function to impute xP
def impute_xp(df):
# Iterate through each player's data
for name, group in df.groupby('name'):
# Loop through rows with NaN in xP
for idx in group[group['xP'].isna()].index:
current_gw = df.loc[idx, 'GW']
# Check for previous and next GWs in the same season
previous_idx = group[
(group['GW'] < current_gw) & (~group['xP'].isna())
].index.max()
next_idx = group[
(group['GW'] > current_gw) & (~group['xP'].isna())
].index.min()
if pd.notna(previous_idx) and pd.notna(next_idx):
# Average of the previous and next valid xP values
df.loc[idx, 'xP'] = (df.loc[previous_idx, 'xP'] + df.loc[next_idx, 'xP']) / 2
elif pd.notna(previous_idx):
# Use the previous valid xP value
df.loc[idx, 'xP'] = df.loc[previous_idx, 'xP']
elif pd.notna(next_idx):
# Use the next valid xP value
df.loc[idx, 'xP'] = df.loc[next_idx, 'xP']
else:
# Fallback: Use the closest available xP value
neighbor_idx = group[~group['xP'].isna()].index.min()
if pd.notna(neighbor_idx):
df.loc[idx, 'xP'] = df.loc[neighbor_idx, 'xP']
return df
# Apply the imputation function
master = impute_xp(master)
# Verify the result
print(len(master[master['xP'].isna()])) # Should be empty if all NaNs are imputed
29
The remaining NaNs are likely due to zero applicable values to impute with, based on our imputation conditions. Therefore, let's go ahead and drop these 29 rows.
print(master.shape)
master = master.dropna()
print(master.shape)
(42967, 51) (42938, 51)
Adding Cumulative and Combination FeaturesΒΆ
Let's also go ahead and add some cumulative/combined metric columns to our database. In particular, let's add 'goals per minute': 'gpm', 'assists per minute': 'apm', 'cumulative_gpm', 'cumulative_apm', 'cumulative_goals', 'cumulative_assists', 'cumulative_xG', 'cumulative_xA', 'cumulative_xGI', 'cumulative_xP', and 'cumulative_points'.
master['gpm'] = master['goals_scored']/master['minutes'] # Create a column for goals per minute
master['apm'] = master['assists']/master['minutes'] # Create a column for assists per minute
# Ensure the 'master' DataFrame is sorted by 'season', 'name', and 'kickoff_time'
master = master.sort_values(by=['Season', 'name', 'kickoff_time'])
# Define a function to calculate cumulative metrics for each season
def calculate_cumulative_metrics(group):
group['cumulative_goals'] = group['goals_scored'].cumsum()
group['cumulative_assists'] = group['assists'].cumsum()
group['cumulative_xG'] = group['expected_goals'].cumsum()
group['cumulative_xA'] = group['expected_assists'].cumsum()
group['cumulative_xGI'] = group['expected_goal_involvements'].cumsum()
group['cumulative_gpm'] = group['cumulative_goals'] / group['minutes'].cumsum()
group['cumulative_apm'] = group['cumulative_assists'] / group['minutes'].cumsum()
group['cumulative_xP'] = group['xP'].cumsum()
group['cumulative_points'] = group['total_points'].cumsum()
group['cumulative_minutes'] = group['minutes'].cumsum()
return group
# Group by both 'season' and 'name', then apply the function
master = master.groupby(['Season', 'name']).apply(calculate_cumulative_metrics)
# Reset the index
master.reset_index(drop=True, inplace=True)
# Display the head
master.head()
| name | position | team | xP | assists | bonus | bps | clean_sheets | creativity | element | ... | cumulative_goals | cumulative_assists | cumulative_xG | cumulative_xA | cumulative_xGI | cumulative_gpm | cumulative_apm | cumulative_xP | cumulative_points | cumulative_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Connolly | FWD | Brighton | 0.5 | 0 | 0 | -3 | 0 | 0.3 | 78 | ... | 0 | 0 | 0.392763 | 0.000000 | 0.392763 | 0.000000 | 0.000000 | 0.5 | 1 | 45 |
| 1 | Aaron Connolly | FWD | Brighton | 4.0 | 0 | 2 | 27 | 1 | 11.3 | 78 | ... | 1 | 0 | 0.554273 | 0.016604 | 0.570877 | 0.007463 | 0.000000 | 4.5 | 9 | 134 |
| 2 | Aaron Connolly | FWD | Brighton | 2.7 | 0 | 0 | 2 | 0 | 12.1 | 78 | ... | 1 | 0 | 0.586928 | 0.057287 | 0.644215 | 0.004831 | 0.000000 | 7.2 | 11 | 207 |
| 3 | Aaron Connolly | FWD | Brighton | 2.7 | 0 | 0 | 7 | 0 | 0.3 | 78 | ... | 1 | 0 | 0.586928 | 0.057287 | 0.644215 | 0.003676 | 0.000000 | 9.9 | 13 | 272 |
| 4 | Aaron Connolly | FWD | Brighton | 3.0 | 1 | 0 | 13 | 0 | 10.3 | 78 | ... | 1 | 1 | 0.586928 | 0.109529 | 0.696457 | 0.003521 | 0.003521 | 12.9 | 17 | 284 |
5 rows Γ 63 columns
Converting Player 'value' Unit to Million GBP (Β£)ΒΆ
The unit for player 'value' is also 100,000s of GBP (Β£). For example, a 'value' of 50 is equivalent to Β£5 million. Therefore, let's go ahead and convert that column to units of millions of GBP.
master['value'] = master['value']/10
Saving a New, Cleaned CSV FileΒΆ
Now, finally, let's go ahead and save a cleaned master file to CSV format.
master.to_csv('../master_cleaned.csv', index=False)
VisualizationsΒΆ
First, let's import the cleaned/filtered data into a new dataframe called 'master_cleaned'
master_cleaned = pd.read_csv('../master_cleaned.csv')
master_cleaned.head()
| name | position | team | xP | assists | bonus | bps | clean_sheets | creativity | element | ... | cumulative_goals | cumulative_assists | cumulative_xG | cumulative_xA | cumulative_xGI | cumulative_gpm | cumulative_apm | cumulative_xP | cumulative_points | cumulative_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aaron Connolly | FWD | Brighton | 0.5 | 0 | 0 | -3 | 0 | 0.3 | 78 | ... | 0 | 0 | 0.392763 | 0.000000 | 0.392763 | 0.000000 | 0.000000 | 0.5 | 1 | 45 |
| 1 | Aaron Connolly | FWD | Brighton | 4.0 | 0 | 2 | 27 | 1 | 11.3 | 78 | ... | 1 | 0 | 0.554273 | 0.016604 | 0.570877 | 0.007463 | 0.000000 | 4.5 | 9 | 134 |
| 2 | Aaron Connolly | FWD | Brighton | 2.7 | 0 | 0 | 2 | 0 | 12.1 | 78 | ... | 1 | 0 | 0.586928 | 0.057287 | 0.644215 | 0.004831 | 0.000000 | 7.2 | 11 | 207 |
| 3 | Aaron Connolly | FWD | Brighton | 2.7 | 0 | 0 | 7 | 0 | 0.3 | 78 | ... | 1 | 0 | 0.586928 | 0.057287 | 0.644215 | 0.003676 | 0.000000 | 9.9 | 13 | 272 |
| 4 | Aaron Connolly | FWD | Brighton | 3.0 | 1 | 0 | 13 | 0 | 10.3 | 78 | ... | 1 | 1 | 0.586928 | 0.109529 | 0.696457 | 0.003521 | 0.003521 | 12.9 | 17 | 284 |
5 rows Γ 63 columns
master_cleaned.columns
Index(['name', 'position', 'team', 'xP', 'assists', 'bonus', 'bps',
'clean_sheets', 'creativity', 'element', 'fixture', 'goals_conceded',
'goals_scored', 'Influence_Creativity_Threat_Index', 'influence',
'kickoff_time', 'minutes', 'opponent_team', 'own_goals',
'penalties_missed', 'penalties_saved', 'red_cards', 'round', 'saves',
'selected', 'team_a_score', 'team_h_score', 'threat', 'total_points',
'transfers_balance', 'transfers_in', 'transfers_out', 'value',
'was_home', 'yellow_cards', 'GW', 'expected_goals', 'expected_assists',
'expected_goal_involvements', 'position_DEF', 'position_FWD',
'position_GK', 'position_MID', 'team_label', 'Hour', 'DayOfWeek',
'Weekend', 'WeekOfYear', 'Month', 'Year', 'Season', 'gpm', 'apm',
'cumulative_goals', 'cumulative_assists', 'cumulative_xG',
'cumulative_xA', 'cumulative_xGI', 'cumulative_gpm', 'cumulative_apm',
'cumulative_xP', 'cumulative_points', 'cumulative_minutes'],
dtype='object')
What makes Premier League performance intriguing? Why should we care about metrics like total points, home vs. away trends, and penalty impacts? By understanding these, we can better evaluate playersβ consistency, impact, and potential.ΒΆ
We start by looking at all the metrics available and how they correlate to total points.
Total points is an important metric to evaluate player performance as it aggregates key contributions such as goals, assists, clean sheets, and bonus points.
master_cleaned_copy = master_cleaned.copy()
numeric_data = master_cleaned_copy.select_dtypes(include=["float64", "int64"])
correlation_matrix = numeric_data.corr()
plt.figure(figsize=(15, 12))
sns.heatmap(correlation_matrix,annot=False, cmap="coolwarm",center=0,vmin=-1,vmax=1,square=True,linewidths=0.5,)
plt.title("Heatmap of Variable Correlations (Collinearity Check)")
plt.show()
Here, we have an overview of the correlations between total points and various Fantasy Premier League metrics, providing a big-picture view of the relationships within the data. At first glance, we can see strong correlations between total points and metrics like BPS, influence, and expected goal involvements, which align with player performance expectations. However, interpreting this heatmap alone has its limitations. The complexity of interactions between variables and potential collinearity make it hard to draw specific, actionable conclusions for team selection.
To uncover deeper insights, we need to break this down further by analyzing metrics specific to player positions.
# Position-wise performance
sns.boxplot(x='position', y='total_points', data=master_cleaned, palette='Set3')
plt.title("Position-wise Distribution of Total Points")
plt.xlabel("Position")
plt.ylabel("Total Points")
plt.show()
This plot reveals the spread of points by players in different positions. The results showcase the variability within each position.
Midfielders (MID) demonstrate the widest spread of points and the highest potential for top performance (indicated by outliers). Defenders (DEF) and Goalkeepers (GK) have tighter distributions, reflecting more consistent, yet limited scoring opportunities. Forwards (FWD) have high outliers due to exceptional performances.
plt.figure(figsize=(10, 6))
sns.kdeplot(data=master_cleaned, x='total_points', hue='position', fill=True, alpha=0.6)
plt.title("Points Per Match Distribution\nGrouped by Player Position", fontsize=16)
plt.xlabel("Points Per Match", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.show()
This plot shows the distribution of points per match for players grouped by their respective positions. We can see that mids have highest density of points per match (suggests they are super consistent with returns). Defs and Fwds have similar distribution but with a wider spread - this suggests variability in returns. There appears to be second distinct peak for defenders that might be for full backs (FBs), who are attacking defenders that have a higher chance of scoring. GKs have distinct narrow distribution which highlights their specialized role (they only get returns when they keep clean sheet and have neglibile avenues to score points beyond clean sheets).
# Calculate the total points per season for each player
season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()
# Define the order of positions for plotting
position_order = ['GK', 'FWD', 'DEF', 'MID']
plt.figure(figsize=(10, 6))
sns.kdeplot(
data=season_points,
x='total_points',
hue='position',
hue_order=position_order, # Control the order of positions
fill = False,
common_norm = False,
alpha=0.6
)
plt.title("Total Season Points Distribution\nGrouped by Player Position", fontsize=16)
plt.xlabel("Total Season Points", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.show()
Plot shows the distribution of total season points for players grouped by their positions. Each curve represents a KDE (Kernel Density Estimate), indicating how points are distributed for each position. For example, midfielders have a higher peak and fatter tail, suggesting a broader range of high-scoring players. GKs appear to cluster at two distinct buckets and this is because GKs from poor teams converge at the first peak and those from the small elite teams converge in the second peak. FWDs also have the longest tail, indicating the presence of exceptional performers.
# Function to calculate total points per season for each player, filter top n players, and count by position
def get_top_players_and_count_by_position(master_cleaned, n=20):
# Calculate total points for the season for each player
season_points = (master_cleaned.groupby(['Season', 'name', 'position'])['total_points'].sum().reset_index().sort_values(by=['Season', 'total_points'], ascending=[True, False]))
# Get the top n players per season
top_players = season_points.groupby('Season').head(n)
# Count the number of players by position for each season
position_counts = top_players.groupby(['Season', 'position'])['name'].count().reset_index()
position_counts.rename(columns={'name': 'count'}, inplace=True)
return top_players, position_counts
# Generate the position_counts
_, position_counts = get_top_players_and_count_by_position(master_cleaned, n=50)
# Pivot the data for a clustered bar plot
pivot_data = position_counts.pivot(index='Season', columns='position', values='count')
# Create a clustered bar plot
fig, ax = plt.subplots(figsize=(12, 6))
pivot_data.plot(kind='bar', ax=ax, width=0.8)
# Add labels and title
ax.set_xlabel('Season', fontsize=12)
ax.set_ylabel('Count of Top Players', fontsize=12)
ax.set_title('Top Players by Position Across Seasons', fontsize=14)
ax.legend(title='Position', fontsize=10)
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This plot shows the count of top players (e.g., top scorers) by position across seasons. Midfielders consistently dominate the top player count across all seasons, followed by defenders and in recent years forwards, with goalkeepers recently having fewer top-performing players. The trend highlights positional disparities in top player representation over time.
avg_points_per_match = master_cleaned.groupby(['Season', 'position'])['total_points'].mean().reset_index()
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_per_match, x='position', y='total_points', hue='Season')
plt.title("Average Points Per Match by Position Across Seasons", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Points Per Match", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
Bar plot showing average points per match for players across different positions and seasons. Goalkeepers appear to be most consistent in point returns suggesting they have consistent performance throughout the whole season whereas other positions likely have higher weekly variance as they score more cumulatively over the season. Defenders in particular appear to be struggling in recent years.
Similarly, midfielders appear to not only be consistent but also have higher points ceilings as they have the highest density for most season points (fat tail in previous KDE figure: Total Season Points Distribution Grouped by Player Position).
total_season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()
N = 50 # N here represents top N players by position
top_n_players_season = (total_season_points.groupby(['Season', 'position']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))
# calculateing avg points per match (PPM) for top N players + mergie with OG df to estimate PPM
merged_top_n = master_cleaned.merge(top_n_players_season[['name', 'Season', 'position']], on=['name', 'Season', 'position'], how='inner')
avg_points_top_n = merged_top_n.groupby(['Season', 'position'])['total_points'].mean().reset_index() # estatmatign average PPM for top N players
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_top_n, x='position', y='total_points', hue='Season')
plt.title(f"Average Points Per Match (Top {N} Players by Position Each Season)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Points Per Match", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
This plot is similar to the one above but with top players only. The trend suggests that midfielders are more likely to be high point-earning potential.
total_season_points = master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()
N = 10 # this does exaclty same as above
top_n_players_season = (total_season_points.groupby(['Season', 'position']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))
avg_total_points_top_n = top_n_players_season.groupby(['Season', 'position'])['total_points'].mean().reset_index() #again same calcs here
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_total_points_top_n, x='position', y='total_points', hue='Season')
plt.title(f"Average Total Season Points (Top {N} Players by Position Each Season)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Total Season Points", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
top_n_players_season
| name | position | Season | total_points | |
|---|---|---|---|---|
| 0 | Stuart Dallas | DEF | 2020-2021 | 171 |
| 1 | Andrew Robertson | DEF | 2020-2021 | 161 |
| 2 | Trent Alexander-Arnold | DEF | 2020-2021 | 160 |
| 3 | Aaron Cresswell | DEF | 2020-2021 | 153 |
| 4 | Aaron Wan-Bissaka | DEF | 2020-2021 | 144 |
| 5 | Ben Chilwell | DEF | 2020-2021 | 139 |
| 6 | Matt Targett | DEF | 2020-2021 | 138 |
| 7 | Lewis Dunk | DEF | 2020-2021 | 130 |
| 8 | John Stones | DEF | 2020-2021 | 128 |
| 9 | Tyrone Mings | DEF | 2020-2021 | 128 |
| 10 | Harry Kane | FWD | 2020-2021 | 242 |
| 11 | Patrick Bamford | FWD | 2020-2021 | 194 |
| 12 | Jamie Vardy | FWD | 2020-2021 | 187 |
| 13 | Ollie Watkins | FWD | 2020-2021 | 168 |
| 14 | Dominic Calvert-Lewin | FWD | 2020-2021 | 165 |
| 15 | Roberto Firmino | FWD | 2020-2021 | 141 |
| 16 | Chris Wood | FWD | 2020-2021 | 138 |
| 17 | Che Adams | FWD | 2020-2021 | 136 |
| 18 | Callum Wilson | FWD | 2020-2021 | 134 |
| 19 | Danny Ings | FWD | 2020-2021 | 131 |
| 20 | Emiliano MartΓnez | GK | 2020-2021 | 186 |
| 21 | Ederson Santana de Moraes | GK | 2020-2021 | 160 |
| 22 | Illan Meslier | GK | 2020-2021 | 154 |
| 23 | Hugo Lloris | GK | 2020-2021 | 149 |
| 24 | Nick Pope | GK | 2020-2021 | 144 |
| 25 | Alisson Ramses Becker | GK | 2020-2021 | 140 |
| 26 | Edouard Mendy | GK | 2020-2021 | 140 |
| 27 | Sam Johnstone | GK | 2020-2021 | 140 |
| 28 | Lukasz Fabianski | GK | 2020-2021 | 133 |
| 29 | Bernd Leno | GK | 2020-2021 | 131 |
| 30 | Bruno Fernandes | MID | 2020-2021 | 244 |
| 31 | Mohamed Salah | MID | 2020-2021 | 231 |
| 32 | Heung-Min Son | MID | 2020-2021 | 228 |
| 33 | Sadio ManΓ© | MID | 2020-2021 | 176 |
| 34 | Marcus Rashford | MID | 2020-2021 | 174 |
| 35 | Jack Harrison | MID | 2020-2021 | 160 |
| 36 | Ilkay GΓΌndogan | MID | 2020-2021 | 157 |
| 37 | James Ward-Prowse | MID | 2020-2021 | 156 |
| 38 | Raheem Sterling | MID | 2020-2021 | 154 |
| 39 | Matheus Pereira | MID | 2020-2021 | 153 |
| 40 | Trent Alexander-Arnold | DEF | 2021-2022 | 208 |
| 41 | Andrew Robertson | DEF | 2021-2022 | 186 |
| 42 | Virgil van Dijk | DEF | 2021-2022 | 183 |
| 43 | Joel Matip | DEF | 2021-2022 | 170 |
| 44 | Aymeric Laporte | DEF | 2021-2022 | 160 |
| 45 | Antonio RΓΌdiger | DEF | 2021-2022 | 150 |
| 46 | Matthew Cash | DEF | 2021-2022 | 147 |
| 47 | Gabriel MagalhΓ£es | DEF | 2021-2022 | 146 |
| 48 | Reece James | DEF | 2021-2022 | 140 |
| 49 | Conor Coady | DEF | 2021-2022 | 138 |
| 50 | Harry Kane | FWD | 2021-2022 | 192 |
| 51 | Cristiano Ronaldo dos Santos Aveiro | FWD | 2021-2022 | 159 |
| 52 | Teemu Pukki | FWD | 2021-2022 | 142 |
| 53 | Michail Antonio | FWD | 2021-2022 | 140 |
| 54 | Ivan Toney | FWD | 2021-2022 | 139 |
| 55 | Emmanuel Dennis | FWD | 2021-2022 | 134 |
| 56 | Jamie Vardy | FWD | 2021-2022 | 133 |
| 57 | Ollie Watkins | FWD | 2021-2022 | 131 |
| 58 | Richarlison de Andrade | FWD | 2021-2022 | 125 |
| 59 | Gabriel Fernando de Jesus | FWD | 2021-2022 | 119 |
| 60 | Alisson Ramses Becker | GK | 2021-2022 | 176 |
| 61 | Hugo Lloris | GK | 2021-2022 | 158 |
| 62 | Ederson Santana de Moraes | GK | 2021-2022 | 155 |
| 63 | Lukasz Fabianski | GK | 2021-2022 | 136 |
| 64 | Aaron Ramsdale | GK | 2021-2022 | 135 |
| 65 | David de Gea | GK | 2021-2022 | 132 |
| 66 | Kasper Schmeichel | GK | 2021-2022 | 131 |
| 67 | Edouard Mendy | GK | 2021-2022 | 130 |
| 68 | Nick Pope | GK | 2021-2022 | 130 |
| 69 | Emiliano MartΓnez | GK | 2021-2022 | 129 |
| 70 | Mohamed Salah | MID | 2021-2022 | 265 |
| 71 | Heung-Min Son | MID | 2021-2022 | 258 |
| 72 | Jarrod Bowen | MID | 2021-2022 | 206 |
| 73 | Kevin De Bruyne | MID | 2021-2022 | 196 |
| 74 | Sadio ManΓ© | MID | 2021-2022 | 183 |
| 75 | James Maddison | MID | 2021-2022 | 181 |
| 76 | Bukayo Saka | MID | 2021-2022 | 179 |
| 77 | Diogo Jota | MID | 2021-2022 | 175 |
| 78 | Mason Mount | MID | 2021-2022 | 169 |
| 79 | Raheem Sterling | MID | 2021-2022 | 162 |
| 80 | Kieran Trippier | DEF | 2022-2023 | 198 |
| 81 | Benjamin White | DEF | 2022-2023 | 156 |
| 82 | Trent Alexander-Arnold | DEF | 2022-2023 | 155 |
| 83 | Gabriel MagalhΓ£es | DEF | 2022-2023 | 146 |
| 84 | Ben Mee | DEF | 2022-2023 | 143 |
| 85 | Fabian SchΓ€r | DEF | 2022-2023 | 139 |
| 86 | Tyrone Mings | DEF | 2022-2023 | 130 |
| 87 | Dan Burn | DEF | 2022-2023 | 129 |
| 88 | Sven Botman | DEF | 2022-2023 | 128 |
| 89 | Pervis EstupiΓ±Γ‘n | DEF | 2022-2023 | 127 |
| 90 | Erling Haaland | FWD | 2022-2023 | 272 |
| 91 | Harry Kane | FWD | 2022-2023 | 263 |
| 92 | Ivan Toney | FWD | 2022-2023 | 182 |
| 93 | Ollie Watkins | FWD | 2022-2023 | 175 |
| 94 | Callum Wilson | FWD | 2022-2023 | 157 |
| 95 | Bryan Mbeumo | FWD | 2022-2023 | 150 |
| 96 | Dominic Solanke | FWD | 2022-2023 | 130 |
| 97 | Gabriel Fernando de Jesus | FWD | 2022-2023 | 125 |
| 98 | Brennan Johnson | FWD | 2022-2023 | 122 |
| 99 | Aleksandar MitroviΔ | FWD | 2022-2023 | 107 |
| 100 | David Raya Martin | GK | 2022-2023 | 166 |
| 101 | Alisson Ramses Becker | GK | 2022-2023 | 162 |
| 102 | David De Gea Quintana | GK | 2022-2023 | 161 |
| 103 | Nick Pope | GK | 2022-2023 | 157 |
| 104 | JosΓ© Malheiro de SΓ‘ | GK | 2022-2023 | 148 |
| 105 | Aaron Ramsdale | GK | 2022-2023 | 143 |
| 106 | Bernd Leno | GK | 2022-2023 | 142 |
| 107 | Emiliano MartΓnez | GK | 2022-2023 | 135 |
| 108 | Lukasz Fabianski | GK | 2022-2023 | 127 |
| 109 | Jordan Pickford | GK | 2022-2023 | 124 |
| 110 | Mohamed Salah | MID | 2022-2023 | 239 |
| 111 | Martin Γdegaard | MID | 2022-2023 | 212 |
| 112 | Marcus Rashford | MID | 2022-2023 | 205 |
| 113 | Bukayo Saka | MID | 2022-2023 | 202 |
| 114 | Gabriel Martinelli | MID | 2022-2023 | 198 |
| 115 | Kevin De Bruyne | MID | 2022-2023 | 183 |
| 116 | Bruno Fernandes | MID | 2022-2023 | 176 |
| 117 | Eberechi Eze | MID | 2022-2023 | 159 |
| 118 | Pascal GroΓ | MID | 2022-2023 | 159 |
| 119 | Miguel AlmirΓ³n Rejala | MID | 2022-2023 | 158 |
| 120 | Benjamin White | DEF | 2023-2024 | 181 |
| 121 | William Saliba | DEF | 2023-2024 | 164 |
| 122 | Gabriel MagalhΓ£es | DEF | 2023-2024 | 148 |
| 123 | Pedro Porro | DEF | 2023-2024 | 136 |
| 124 | Jarrad Branthwaite | DEF | 2023-2024 | 124 |
| 125 | Fabian SchΓ€r | DEF | 2023-2024 | 123 |
| 126 | JoΕ‘ko Gvardiol | DEF | 2023-2024 | 123 |
| 127 | Kyle Walker | DEF | 2023-2024 | 123 |
| 128 | Trent Alexander-Arnold | DEF | 2023-2024 | 122 |
| 129 | Joachim Andersen | DEF | 2023-2024 | 121 |
| 130 | Ollie Watkins | FWD | 2023-2024 | 228 |
| 131 | Erling Haaland | FWD | 2023-2024 | 217 |
| 132 | Dominic Solanke | FWD | 2023-2024 | 175 |
| 133 | Alexander Isak | FWD | 2023-2024 | 172 |
| 134 | Jean-Philippe Mateta | FWD | 2023-2024 | 163 |
| 135 | JuliΓ‘n Γlvarez | FWD | 2023-2024 | 157 |
| 136 | Carlton Morris | FWD | 2023-2024 | 146 |
| 137 | Nicolas Jackson | FWD | 2023-2024 | 142 |
| 138 | Matheus Santos Carneiro Da Cunha | FWD | 2023-2024 | 135 |
| 139 | Darwin NΓΊΓ±ez Ribeiro | FWD | 2023-2024 | 131 |
| 140 | Jordan Pickford | GK | 2023-2024 | 153 |
| 141 | David Raya Martin | GK | 2023-2024 | 135 |
| 142 | AndrΓ© Onana | GK | 2023-2024 | 133 |
| 143 | Bernd Leno | GK | 2023-2024 | 133 |
| 144 | Mark Flekken | GK | 2023-2024 | 119 |
| 145 | Alphonse Areola | GK | 2023-2024 | 116 |
| 146 | Emiliano MartΓnez | GK | 2023-2024 | 115 |
| 147 | Ederson Santana de Moraes | GK | 2023-2024 | 112 |
| 148 | Guglielmo Vicario | GK | 2023-2024 | 112 |
| 149 | Norberto Murara Neto | GK | 2023-2024 | 110 |
| 150 | Cole Palmer | MID | 2023-2024 | 244 |
| 151 | Bukayo Saka | MID | 2023-2024 | 226 |
| 152 | Phil Foden | MID | 2023-2024 | 226 |
| 153 | Heung-Min Son | MID | 2023-2024 | 213 |
| 154 | Mohamed Salah | MID | 2023-2024 | 211 |
| 155 | Martin Γdegaard | MID | 2023-2024 | 186 |
| 156 | Anthony Gordon | MID | 2023-2024 | 183 |
| 157 | Jarrod Bowen | MID | 2023-2024 | 182 |
| 158 | Kai Havertz | MID | 2023-2024 | 180 |
| 159 | Bruno Fernandes | MID | 2023-2024 | 166 |
| 160 | Virgil van Dijk | DEF | 2024-2025 | 45 |
| 161 | Trent Alexander-Arnold | DEF | 2024-2025 | 44 |
| 162 | JoΕ‘ko Gvardiol | DEF | 2024-2025 | 42 |
| 163 | Gabriel MagalhΓ£es | DEF | 2024-2025 | 41 |
| 164 | Ibrahima KonatΓ© | DEF | 2024-2025 | 39 |
| 165 | Diogo Dalot Teixeira | DEF | 2024-2025 | 37 |
| 166 | Lucas Digne | DEF | 2024-2025 | 35 |
| 167 | Cristian Romero | DEF | 2024-2025 | 32 |
| 168 | Ola Aina | DEF | 2024-2025 | 32 |
| 169 | Andrew Robertson | DEF | 2024-2025 | 31 |
| 170 | Erling Haaland | FWD | 2024-2025 | 75 |
| 171 | Chris Wood | FWD | 2024-2025 | 59 |
| 172 | Danny Welbeck | FWD | 2024-2025 | 57 |
| 173 | Nicolas Jackson | FWD | 2024-2025 | 57 |
| 174 | Ollie Watkins | FWD | 2024-2025 | 51 |
| 175 | Kai Havertz | FWD | 2024-2025 | 44 |
| 176 | Matheus Santos Carneiro Da Cunha | FWD | 2024-2025 | 41 |
| 177 | RaΓΊl JimΓ©nez | FWD | 2024-2025 | 41 |
| 178 | Yoane Wissa | FWD | 2024-2025 | 39 |
| 179 | Jamie Vardy | FWD | 2024-2025 | 38 |
| 180 | AndrΓ© Onana | GK | 2024-2025 | 42 |
| 181 | Matz Sels | GK | 2024-2025 | 42 |
| 182 | Robert SΓ‘nchez | GK | 2024-2025 | 39 |
| 183 | David Raya Martin | GK | 2024-2025 | 37 |
| 184 | Nick Pope | GK | 2024-2025 | 36 |
| 185 | Alisson Ramses Becker | GK | 2024-2025 | 35 |
| 186 | Jordan Pickford | GK | 2024-2025 | 33 |
| 187 | Dean Henderson | GK | 2024-2025 | 32 |
| 188 | Emiliano MartΓnez | GK | 2024-2025 | 30 |
| 189 | Ederson Santana de Moraes | GK | 2024-2025 | 29 |
| 190 | Mohamed Salah | MID | 2024-2025 | 84 |
| 191 | Cole Palmer | MID | 2024-2025 | 79 |
| 192 | Bryan Mbeumo | MID | 2024-2025 | 68 |
| 193 | Bukayo Saka | MID | 2024-2025 | 63 |
| 194 | Luis DΓaz | MID | 2024-2025 | 60 |
| 195 | Dwight McNeil | MID | 2024-2025 | 49 |
| 196 | Noni Madueke | MID | 2024-2025 | 46 |
| 197 | James Maddison | MID | 2024-2025 | 45 |
| 198 | Jarrod Bowen | MID | 2024-2025 | 45 |
| 199 | Emile Smith Rowe | MID | 2024-2025 | 41 |
master_cleaned['value'].describe()
count 42938.000000 mean 5.421533 std 1.396305 min 3.600000 25% 4.500000 50% 5.000000 75% 5.700000 max 15.400000 Name: value, dtype: float64
The following sections looks at key metrics relative to players positions.ΒΆ
master_cleaned_copy["kickoff_time"] = pd.to_datetime(master_cleaned_copy["kickoff_time"])
master_cleaned_copy["season"] = master_cleaned_copy["kickoff_time"].apply(lambda x: f"{x.year}/{x.year + 1}" if x.month >= 8 else f"{x.year - 1}/{x.year}")
# we need to group players into bins
position_bins = {
"GK": "Goalkeepers",
"DEF": "Defenders",
"MID": "Midfielders",
"FWD": "Forwards",
}
master_cleaned_copy["position_bin"] = master_cleaned_copy["position"].map(position_bins) # now mapping positions to bins
#grouping by seaosn and posiiton bins
grouped = master_cleaned_copy.groupby(["season", "position_bin"]).agg({"name": "count", "xP": "mean"}).rename(columns={"name": "player_count", "xP": "avg_xP"}).reset_index()
print(grouped)
season position_bin player_count avg_xP 0 2020/2021 Defenders 3161 2.750854 1 2020/2021 Forwards 1244 3.021624 2 2020/2021 Goalkeepers 709 3.711142 3 2020/2021 Midfielders 4139 2.821889 4 2021/2022 Defenders 3194 3.100423 5 2021/2022 Forwards 1258 3.087043 6 2021/2022 Goalkeepers 724 3.933218 7 2021/2022 Midfielders 4236 3.079072 8 2022/2023 Defenders 3608 2.595926 9 2022/2023 Forwards 1384 2.982117 10 2022/2023 Goalkeepers 769 3.557802 11 2022/2023 Midfielders 5085 2.730610 12 2023/2024 Defenders 3678 2.354323 13 2023/2024 Forwards 1333 3.091860 14 2023/2024 Goalkeepers 772 3.086593 15 2023/2024 Midfielders 5009 2.740397 16 2024/2025 Defenders 856 2.287675 17 2024/2025 Forwards 290 3.080345 18 2024/2025 Goalkeepers 182 3.128846 19 2024/2025 Midfielders 1307 2.450956
Now we will look at how different metrics align with total points for each position.
master_cleaned_copy["was_home"] = master_cleaned_copy["was_home"].apply(lambda x: 1 if x == True else 0) # hot encoding `was_home`
master_cleaned_copy = master_cleaned_copy.drop(columns=["name", "position", "team"]) # dropping categorical columns
target = "total_points"
threshold = 0.2 # seetting the correlation threshold
for position, group in master_cleaned_copy.groupby("position_bin"):
numeric_data = group.select_dtypes(include=["float64", "int64"])
correlation = numeric_data.corr()[target].sort_values(ascending=False)
correlation = correlation.drop(target)
correlation = correlation[correlation.abs() > threshold] # filtering correlations by the threshold
plt.figure(figsize=(10, 8))
sns.heatmap(correlation.to_frame(), annot=True, cmap="coolwarm", fmt=".2f", cbar=True, yticklabels=correlation.index)
plt.title(f"Correlations with {target} for {position}")
plt.show()
Interesting - we see that different independent variables are more highly correlated across positions which makes sense as GKs and defenders rely on clean sheets. Gks also rely on saves for points. Whereas, Mids and Fwds rely on goals and assists with Fwds having stronger correlation for goal scored and Mids for creative playmaking.
Defensive Player MetricsΒΆ
gk_def_data = master_cleaned[master_cleaned['position'].isin(['GK', 'DEF'])]
# Calculate cumulative metrics
gk_def_data['cumulative_clean_sheets'] = gk_def_data.groupby(['name', 'Season'])['clean_sheets'].cumsum()
gk_def_data['cumulative_saves'] = gk_def_data.groupby(['name', 'Season'])['saves'].cumsum()
gk_def_data['cumulative_goals_conceded'] = gk_def_data.groupby(['name', 'Season'])['goals_conceded'].cumsum()
gk_def_data['cumulative_points'] = gk_def_data.groupby(['name', 'Season'])['total_points'].cumsum()
# Aggregate by Gameweek
def_data = gk_def_data[gk_def_data['position'] == 'DEF'].groupby('GW').sum(numeric_only=True)
gk_data = gk_def_data[gk_def_data['position'] == 'GK'].groupby('GW').sum(numeric_only=True)
plt.figure(figsize=(10, 6))
# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(def_data.index, def_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)
# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(def_data.index, def_data['cumulative_clean_sheets'], label='Clean Sheets', color='blue', marker='o', linestyle='--')
ax2.plot(def_data.index, def_data['cumulative_goals_conceded'] / 10, label='Goals Conceded (Divided by 10)', color='brown', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')
# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)
plt.title('Defenders: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
plt.figure(figsize=(10, 6))
# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(gk_data.index, gk_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)
# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(gk_data.index, gk_data['cumulative_clean_sheets'], label='Clean Sheets', color='blue', marker='o', linestyle='--')
ax2.plot(gk_data.index, gk_data['cumulative_goals_conceded'] / 10, label='Goals Conceded (Divided by 10)', color='brown', marker='o', linestyle='--')
ax2.plot(gk_data.index, gk_data['cumulative_saves'] / 10, label='Saves (Divided by 10)', color='salmon', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')
# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)
plt.title('Goalkeepers: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
These trends compares key defensive metrics for goalkeepers and defenders over the course of a season, emphasizing howΒ clean sheets, goals conceded and saves andΒ total pointsΒ evolve by gameweek. For defender the cumulative points (green line) show a consistent rise, aligning closely with clean sheets (blue line), while goals conceded (brown line) contribute less significantly to their overall points. for Goalkeepers the cumulative points (green line) also increases steadily but at lower rates (as expected since GKs are not high-point earning potential players) but show a more signifcant contribution from saves (orange line) and clean sheets (blue line).
Offensive Player Metrics (Mids and Fwds)ΒΆ
# Filter the data for midfielders (MID) and forwards (FWD)
mid_fwd_data = master_cleaned[master_cleaned['position'].isin(['MID', 'FWD'])]
# Calculate cumulative metrics
mid_fwd_data['cumulative_goals_scored'] = mid_fwd_data.groupby(['name', 'Season'])['goals_scored'].cumsum()
mid_fwd_data['cumulative_assists'] = mid_fwd_data.groupby(['name', 'Season'])['assists'].cumsum()
mid_fwd_data['cumulative_points'] = mid_fwd_data.groupby(['name', 'Season'])['total_points'].cumsum()
# Aggregate by Gameweek
mid_data = mid_fwd_data[mid_fwd_data['position'] == 'MID'].groupby('GW').sum(numeric_only=True)
fwd_data = mid_fwd_data[mid_fwd_data['position'] == 'FWD'].groupby('GW').sum(numeric_only=True)
plt.figure(figsize=(10, 6))
# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(mid_data.index, mid_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)
# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(mid_data.index, mid_data['cumulative_goals_scored'], label='Goals Scored', color='blue', marker='o', linestyle='--')
ax2.plot(mid_data.index, mid_data['cumulative_assists'], label='Assists', color='orange', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')
# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)
plt.title('Midfielders: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
plt.figure(figsize=(10, 6))
# Primary axis for cumulative points
fig, ax1 = plt.subplots(figsize=(10, 6))
ax1.plot(fwd_data.index, fwd_data['cumulative_points'], label='Points', color='green', marker='^', linestyle='-')
ax1.set_xlabel('Gameweek', fontsize=14)
ax1.set_ylabel('Cumulative Points (Seasonal Total)', fontsize=14, color='green')
ax1.tick_params(axis='y', labelcolor='green')
ax1.grid(alpha=0.3)
# Secondary axis for other metrics
ax2 = ax1.twinx()
ax2.plot(fwd_data.index, fwd_data['cumulative_goals_scored'], label='Goals Scored', color='blue', marker='o', linestyle='--')
ax2.plot(fwd_data.index, fwd_data['cumulative_assists'], label='Assists', color='orange', marker='o', linestyle='--')
# ax2.plot(fwd_data.index, fwd_data['cumulative_threat'] / 100, label='Threat (Divided by 100)', color='red', marker='o', linestyle='--')
ax2.set_ylabel('Other Metrics (Scaled)', fontsize=14, color='black')
ax2.tick_params(axis='y', labelcolor='black')
# Legend and Title
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=12)
plt.title('Forwards: Seasonal Total Cumulative Metrics by Gameweek', fontsize=14)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
These trends compares key offensive metrics for forwards and midfielders over the course of a season, emphasizing howΒ goals scored,Β assists, andΒ total pointsΒ evolve by gameweek. For forwards the cumulative points (green line) show a consistent rise, aligning closely with goals scored (blue line), while assists (orange line) contribute less significantly to their overall points whereas for Midfielders The cumulative points (green line) also increase steadily but show a more balanced contribution from both goals scored (blue line) and assists (orange line). This underscores the dual role midfielders play in both scoring and creating opportunities.
The following visualizations in this cateogry are more focused showing the correlations between specific performance-related metrics.
# Filter and reorganize the dataset for relevant features
selected_columns = [
'total_points', 'goals_scored', 'assists', 'expected_goals', 'expected_assists',
'expected_goal_involvements', 'clean_sheets', 'minutes', 'penalties_missed',
'influence', 'creativity', 'threat', 'bps'
]
copy = master_cleaned[selected_columns]
# Compute the correlation matrix
correlation_matrix = copy.corr()
# Sort the matrix by correlation with 'total_points'
correlation_matrix = correlation_matrix.sort_values(by='total_points', ascending=False, axis=0)
correlation_matrix = correlation_matrix.sort_values(by='total_points', ascending=False, axis=1)
# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f',
vmin=-1, vmax=1, cbar_kws={"shrink": 0.8}, linewidths=0.5
)
plt.title("Correlation Matrix of Key Player Performance Metrics", fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(fontsize=8)
plt.show()
The metric 'total_points' has strong correlations with 'bps' (bonus points system), 'goals_scored', and 'influence'. This indicates these are the primary drivers of a player's overall FPL performance.
Metrics like 'penalties_missed' show little to no correlation with total_points, suggesting they have a minimal impact.
Advanced metrics like 'expected_goal_involvements' and 'expected_goals' show strong relationships with 'goals_scored' and 'total_points', validating their predictive value for future performance.
# Select key metrics for pairwise relationships
pairwise_metrics = ['total_points', 'bps', 'goals_scored', 'expected_goal_involvements']
sns.pairplot(master_cleaned[pairwise_metrics], kind='reg', diag_kind='kde', palette='coolwarm')
plt.suptitle("Key Pairwise Relationships Between Metrics", y=1.02, fontsize=16, fontweight='bold')
plt.show()
Insights from the pairwise relationships:
- Total points and BPS shows a positive linear relationship. BPS is a strong indicator of total points as it reflects a player's overall contribution in a match (tackles, passes, etc.).
- Total points increase with goals scored, as goals directly contribute to a player's point tally.
- Total points and expected goal involvments correlation is less linear, as not all points come from goals (e.g. clean sheets or assists also contribute).
- BPS and expected goal involvments show some positive association, but not as strong as other indicators.
- Goals Scored and Expected Goal Involvements show a moderate linear correlation. Players who score more tend to have higher xG/xA metrics.
The graph below summarizes the main metrics we have analyzed and their correlation with our target variable, 'total_points'.
Category 2: Advanced Metrics (xG and xA)ΒΆ
This section introduces expected goals (xG) and expected assists (xA) as metrics that provide deeper insights by quantifying the quality of chances, helping to differentiate between sustainable performance and statistical anomalies.
xG measures the cumulative probability of scoring based on the quality of chances, while xA estimates the likelihood that a pass will lead to a goal.
xG and xA quantify the quality of chances created or taken, providing a reliable indicator of a player's underlying performance. They help identify players who are overperforming or underperforming relative to expectations.
def plot_multiple_players_xg_vs_goals(player_names, season):
plt.figure(figsize=(12, 8)) # Define figure size
color_palette = sns.color_palette("tab10", len(player_names)) # Generate distinct colors for each player
for idx, player_name in enumerate(player_names):
# Filter data for the player and season
player_data = master_cleaned[(master_cleaned['name'].str.contains(player_name, case=False, na=False)) & (master_cleaned_copy['season'] == season)]
player_data = player_data.sort_values(by='GW') # Sorting by Gameweek
# Assign a color for the player
player_color = color_palette[idx]
# Plot cumulative goals and xG for the player
plt.plot(player_data['GW'], player_data['cumulative_goals'], label=f"{player_name} - Goals", color=player_color, linewidth=2)
plt.plot(player_data['GW'], player_data['cumulative_xG'], label=f"{player_name} - xG", color=player_color, linestyle="--", linewidth=2)
# Set plot title and labels
plt.title(f"Goals vs Expected Goals ({season})", fontsize=16, fontweight='bold')
plt.xlabel("Gameweek", fontsize=12)
plt.ylabel("Cumulative Count", fontsize=12)
# Custom legend to group goals and xG by color
custom_lines = [
mlines.Line2D([], [], color=color_palette[i], linewidth=2, label=f"{player_names[i]} - Goals") for i in range(len(player_names))
] + [
mlines.Line2D([], [], color=color_palette[i], linestyle="--", linewidth=2, label=f"{player_names[i]} - xG") for i in range(len(player_names))
]
plt.legend(handles=custom_lines, fontsize=10, loc='upper left', bbox_to_anchor=(1, 1))
# Add grid and adjust layout
plt.grid(alpha=0.5)
plt.tight_layout()
plt.show()
# Example usage
player_names = ["Erling Haaland", "Kai Havertz"] # List of player names
season = "2022/2023" # Season
plot_multiple_players_xg_vs_goals(player_names, season)
- This chart compares the cumulative Goals and Expected Goals (xG) for Erling Haaland and Kai Havertzβacross the 2022/2023 season.
- Erling Haaland significantly exceeds his xG, showing exceptional finishing ability and efficiency, as his goals curve consistently outpaces his xG.
- Kai Havertz, however, lags behind his xG in the second half of the season, highlighting inefficiencies in converting chances.
- This analysis shows that xG is an interesting metric for assessing goals, with some players outperforming and underperforming.
def plot_multiple_players_xa_vs_assists(player_names, season):
plt.figure(figsize=(12, 8)) # Define figure size
color_palette = sns.color_palette("tab10", len(player_names)) # Generate distinct colors for each player
for idx, player_name in enumerate(player_names):
# Filter data for the player and season
player_data = master_cleaned[(master_cleaned['name'].str.contains(player_name, case=False, na=False)) & (master_cleaned_copy['season'] == season)]
player_data = player_data.sort_values(by='GW') # Sorting by Gameweek
# Assign a color for the player
player_color = color_palette[idx]
# Plot cumulative goals and xG for the player
plt.plot(player_data['GW'], player_data['cumulative_assists'], label=f"{player_name} - Assists", color=player_color, linewidth=2)
plt.plot(player_data['GW'], player_data['cumulative_xA'], label=f"{player_name} - xA", color=player_color, linestyle="--", linewidth=2)
# Set plot title and labels
plt.title(f"Assists vs Expected Assists ({season})", fontsize=16, fontweight='bold')
plt.xlabel("Gameweek", fontsize=12)
plt.ylabel("Cumulative Count", fontsize=12)
# Custom legend to group goals and xG by color
custom_lines = [
mlines.Line2D([], [], color=color_palette[i], linewidth=2, label=f"{player_names[i]} - Assists") for i in range(len(player_names))
] + [
mlines.Line2D([], [], color=color_palette[i], linestyle="--", linewidth=2, label=f"{player_names[i]} - xA") for i in range(len(player_names))
]
plt.legend(handles=custom_lines, fontsize=10, loc='upper left', bbox_to_anchor=(1, 1))
# Add grid and adjust layout
plt.grid(alpha=0.5)
plt.tight_layout()
plt.show()
# Example usage
player_names = ["Bukayo Saka", "Bruno Fernandes"] # List of player names
season = "2023/2024" # Season
plot_multiple_players_xa_vs_assists(player_names, season)
This chart compares Saka and Fernandes in the 2023/2024 season. Saka's assists exceed his xA, indicating overperformance, whereas Fernandes slightly underperforms relative to his xA.
Understanding over- and underperformance is critical for identifying different types of players. Players like Haaland, who consistently overperform their xG or xA, demonstrate elite finishing or creativity, highlighting their unique ability to convert opportunities beyond statistical expectations. On the other hand, players underperforming these metrics may indicate inefficiency or bad luck, but they could also represent undervalued opportunities if their underlying statistics remain strong and consistent. This insight helps managers differentiate between sustainable excellence and potential rebounds in performance.
# Filter data for the 2024-2025 season
season_2024_2025 = master_cleaned[master_cleaned['Season'] == '2024-2025']
# Aggregate data to calculate total expected goals and actual goals scored
aggregated_data = season_2024_2025.groupby(['name']).agg({
'expected_goals': 'sum', # Aggregating total expected goals
'goals_scored': 'sum' # Aggregating total goals scored
}).reset_index()
# Filter top performers based on Aggregated Expected Goals or Actual Goals Scored
top_performers = aggregated_data[
(aggregated_data['expected_goals'] > 4) |
(aggregated_data['goals_scored'] > 4)
]
# Plotting the scatter plot without position hue
plt.figure(figsize=(12, 8))
sns.scatterplot(
x='expected_goals',
y='goals_scored',
data=top_performers,
edgecolor="w",
s=100,
color='blue' # Single color for all points
)
# Add a reference line for x = y
max_value = max(top_performers['expected_goals'].max(),
top_performers['goals_scored'].max())
plt.plot(
[0, max_value],
[0, max_value],
'k--', linewidth=1, label="x = y"
)
# Annotate each player by name
for _, row in top_performers.iterrows():
plt.text(
row['expected_goals'],
row['goals_scored'],
row['name'],
fontsize=9,
alpha=0.9,
rotation=45
)
# Customize plot
plt.title("Aggregated Expected Goals vs Actual Goals Scored by Player (2024-2025)", fontsize=14, fontweight='bold')
plt.xlabel("Aggregated Expected Goals", fontsize=12)
plt.ylabel("Aggregated Goals Scored", fontsize=12)
plt.grid(alpha=0.3)
plt.xlim(0, top_performers['expected_goals'].max() + 1)
plt.ylim(0, top_performers['goals_scored'].max() + 1)
plt.tight_layout()
plt.show()
The scatterplot compares Aggregated Expected Goals (xG) to Actual Goals Scored for top-performing players so far this season. The diagonal line represents perfect alignment between xG and goals scored (x = y). Players above the line, such as Erling Haaland, have outperformed their xG, suggesting exceptional finishing ability, a favorable streak, or positive variance. Conversely, players below the line, like Kai Havertz and Brennan Johnson, are generating strong underlying numbers but may have been on the wrong side of variance or unlucky with their finishing. Understanding this relationship is critical for assessing player sustainability. Overperformance may not always be repeatable, while underperforming players with strong xG figures could represent undervalued opportunities likely to deliver better returns over time. This analysis emphasizes the importance of xG in identifying both reliable performers and potential breakout candidates.
Extra Analysis: BPS and its relation to Player PositionΒΆ
total_season_bonus = (master_cleaned.groupby(['name', 'position', 'Season'])['bonus'].sum().reset_index())
N = 50
top_n_players_season = (total_season_bonus.groupby(['Season']).apply(lambda group: group.nlargest(N, 'bonus')).reset_index(drop=True))
bonus_points_by_position = (top_n_players_season.groupby(['Season', 'position'])['bonus'].mean().reset_index()) #aggregating bonus points by position and season for top 'N' players (note: this changes o/p)
plt.figure(figsize=(12, 8))
sns.barplot(data=bonus_points_by_position,x='position',y='bonus',hue='Season')
plt.title(f"Average Bonus Points by Position and Season (Top {N} Players)", fontsize=16)
plt.xlabel("Position", fontsize=12)
plt.ylabel("Average Bonus Points", fontsize=12)
plt.legend(title="Season", loc='upper left')
plt.show()
- This chart shows the average bonus points for the top 50 players for different FPL player positions.
- Midfielders (MID) and Forwards (FWDs) clearly dominate BPS throughout the season, reflecting their balanced contribution to goals, assists, and defensive actions, which makes them a central part of any fantasy team.
master_cleaned_copy.columns
Index(['xP', 'assists', 'bonus', 'bps', 'clean_sheets', 'creativity',
'element', 'fixture', 'goals_conceded', 'goals_scored',
'Influence_Creativity_Threat_Index', 'influence', 'kickoff_time',
'minutes', 'opponent_team', 'own_goals', 'penalties_missed',
'penalties_saved', 'red_cards', 'round', 'saves', 'selected',
'team_a_score', 'team_h_score', 'threat', 'total_points',
'transfers_balance', 'transfers_in', 'transfers_out', 'value',
'was_home', 'yellow_cards', 'GW', 'expected_goals', 'expected_assists',
'expected_goal_involvements', 'position_DEF', 'position_FWD',
'position_GK', 'position_MID', 'team_label', 'Hour', 'DayOfWeek',
'Weekend', 'WeekOfYear', 'Month', 'Year', 'Season', 'gpm', 'apm',
'cumulative_goals', 'cumulative_assists', 'cumulative_xG',
'cumulative_xA', 'cumulative_xGI', 'cumulative_gpm', 'cumulative_apm',
'cumulative_xP', 'cumulative_points', 'cumulative_minutes', 'season',
'position_bin'],
dtype='object')
Category 3: Miscellaneous graphs (important metrics which affect the total points of the players)ΒΆ
Home and Away MatchesΒΆ
Examining home versus away performance and key match metrics reveals the contextual factors influencing player output. This analysis provides insights into how match location and key game events contribute to total points.
# Home vs Away performance
sns.boxplot(x='was_home', y='total_points', data=master_cleaned, palette=['red', 'blue'])
plt.title("Home vs Away Performance (Total Points)")
plt.xlabel("Was Home")
plt.ylabel("Total Points")
plt.show()
Home advantage is evident from the higher average and median points at home. Such insights can influence fantasy team captain choices for different fixtures. For example, for a liverpool home match, managers could choose liverpool members over other premier league team members.
Below is a trend looking at the home and away matches from another angle. This looks at whether specific players perform better at home vs away matches.
# Calculate Home Points and Away Points based on 'was_home'
home_points_2 = master_cleaned[master_cleaned['was_home'] == True].groupby(['name', 'position']).agg({
'total_points': 'mean'
}).rename(columns={'total_points': 'Home Points'})
away_points_2 = master_cleaned[master_cleaned['was_home'] == False].groupby(['name', 'position']).agg({
'total_points': 'mean'
}).rename(columns={'total_points': 'Away Points'})
# Merge Home and Away Points
player_comparison_filtered = home_points_2.merge(away_points_2, on=['name', 'position'], how='outer').reset_index()
# Calculate Total Points (sum of average Home and Away Points)
player_comparison_filtered['Total Points'] = (
player_comparison_filtered['Home Points'].fillna(0) +
player_comparison_filtered['Away Points'].fillna(0)
)
# Sort by Total Points and filter the top 50 players
top30_players = player_comparison_filtered.sort_values(by='Total Points', ascending=False).head(30)
# Plotting the scatter plot with hue based on player position
plt.figure(figsize=(8, 6))
sns.scatterplot(
x='Away Points',
y='Home Points',
hue='position',
data=top30_players,
edgecolor="w",
s=100,
palette="Set2"
)
# Add a reference line for x = y
max_value = top30_players[['Away Points', 'Home Points']].max().max()
plt.plot(
[0, max_value],
[0, max_value],
'k--', linewidth=1, label="x = y"
)
# Annotate players' names
for _, row in top30_players.iterrows():
plt.text(
row['Away Points'],
row['Home Points'],
row['name'],
fontsize=8,
alpha=0.8,
rotation=45
)
# Add plot enhancements
plt.title("Top 30 Player Performances: Home vs Away", fontsize=14, fontweight='bold')
plt.xlabel("Average Away Points", fontsize=12)
plt.ylabel("Average Home Points", fontsize=12)
plt.grid(alpha=0.3)
plt.xlim(0, top30_players['Away Points'].max() + 0.05)
plt.ylim(0, top30_players['Home Points'].max() + 0.05)
plt.legend(title="Position", fontsize=10)
plt.show()
The majority of the data points cluster close to the diagonal line x = y, indicating that for most players, their performance at home and away is relatively similar. However, there are noticeable variations where some players perform significantly better either at home (above the diagonal) or away (below the diagonal).
- Midfielders (Orange): Some exhibit standout performances at home, reflected in their higher values on the y-axis.
- Forwards (Green): Many forwards are near or slightly above the x = y line, suggesting their performance might be more consistent but with a slight home advantage.
Gareth Bale shows exceptionally strong home performance relative to their away stats. Players closer to the diagonal line (e.g. Harry Kane) demonstrate balanced performance across home and away matches. Teams could leverage this data to select players for specific matches. For instance, away matches might require players like Hourihane, while home matches could benefit from players like Bale.
This metric can help managers decide which players are better suited for home vs away games.
# Calculate total goals, assists, and clean sheets for home and away games
metrics_home_away = master_cleaned.groupby('was_home')[['goals_scored', 'assists', 'clean_sheets']].sum().reset_index()
# Bar plot for total metrics
metrics_home_away_melted = metrics_home_away.melt(id_vars='was_home', var_name='Metric', value_name='Count')
# Explicit labeling with custom legend
plt.figure(figsize=(12, 6))
sns.barplot(
x='Metric',
y='Count',
hue='was_home',
data=metrics_home_away_melted,
palette=['red', 'blue']
)
# Add plot enhancements
plt.title("Total Goals, Assists, and Clean Sheets (Home vs Away)", fontsize=14, fontweight='bold')
plt.ylabel("Count", fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
- Home matches outperform away matches in all three metrics: goals scored, assists, and clean sheets.
- The largest difference is observed in clean sheets, suggesting stronger defensive performances at home.
Penalty and Red Card ImpactΒΆ
The following trends are investigating penalties missed and red cards to assesses their negative impact on player scores and overall performance.
# Analyze penalties missed/saved impact
sns.boxplot(x='penalties_missed', y='total_points', data=master_cleaned, palette='Set2')
plt.title("Impact of Penalties Missed on Total Points")
plt.xlabel("Penalties Missed")
plt.ylabel("Total Points")
plt.show()
Players who missed a penalty (indicated by 1 on the x-axis) generally show a lower distribution of 'total_points' than players who did not miss a penalty (indicated by 0 on the x-axis). This could be because players who usually perform well are chosen to take these kicks.
# Red cards impact
sns.boxplot(x='red_cards', y='total_points', data=master_cleaned, palette='Set1')
plt.title("Impact of Red Cards on Total Points")
plt.xlabel("Red Cards")
plt.ylabel("Total Points")
plt.show()
Players receiving a red card (indicated by 1 on the x-axis) have a significantly lower median 'total_points' compared to those without red cards (indicated by 0). Unlike penalties missed, red cards appear to have a more consistent and severe impact on fantasy scores.
Total Points vs Minutes PlayedΒΆ
master_cleaned = pd.read_csv("../master_cleaned.csv")
# Define bins and labels
bins = [5, 15, 30, 45, 60, 75, 90]
labels = ['5-15', '15-30', '30-45', '45-60', '60-75', '75-90']
# Create a new column for binned minutes
master_cleaned['minute_bins'] = pd.cut(master_cleaned['minutes'], bins=bins, labels=labels, right=False)
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='minute_bins', y='total_points', data=master_cleaned, palette='Blues', cut=0)
# Customize the plot
plt.title('Total Points Distribution by Minute Bins', fontsize=16)
plt.xlabel('Minute Bins', fontsize=14)
plt.ylabel('Total Points', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# Show the plot
plt.tight_layout()
plt.show()
Each violin represents the distribution of total_points for players who played within specific minute_bins. The wider sections of the violins indicate where the density of total_points is higher. The 75-90 bin shows a broader distribution compared to 5-15, meaning players playing 75-90 minutes tend to have a wider range of total points.
The plot reveals that players who play more minutes generally score higher total points (but also can score less points than in lower minute bins). The violins widen and shift higher on the y-axis for bins like 60-75 and 75-90.
The vertical extent of the violins shows the spread of outliers. For instance, in the 75-90 bin, there are instances of extremely low or high total points, reflecting variability in player performance even with significant playing time.
Influence Creativity Threat IndexΒΆ
# Influence-Creativity-Threat Index vs Total Points
sns.scatterplot(x='Influence_Creativity_Threat_Index', y='total_points', hue='position', data=master_cleaned, palette='bright', alpha=0.7)
plt.title("Influence-Creativity-Threat Index vs Total Points")
plt.xlabel("Influence-Creativity-Threat Index")
plt.ylabel("Total Points")
plt.legend(title="Position")
plt.show()
There appears to be a positive trend between players with higher 'Influence-Creativity-Threat-Index' and 'total_points'.Forward players dominate high threat and total points due to their primary scoring roles. However, midfielders with balanced indices contribute equally, underlining versatility.
# Key event metrics across gameweeks
gameweek_metrics = master_cleaned.groupby('GW')[['goals_scored', 'assists', 'clean_sheets']].sum()
gameweek_metrics.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title("Key Metrics Across Gameweeks")
plt.xlabel("Gameweek")
plt.ylabel("Count")
plt.legend(title="Metrics")
plt.show()
Clean sheets number higher than goals and assists, meaning that defenders and goalkeepers with a good clean sheet record are valuable. Goals_scored and assists are fewer in number, which indicates that they are rarer events in a football match compared to clean sheets. However, this is offset by the fact that they carry higher points when they take place (greater point-earning potential). Fantasy football managers often use gameweek trends to plan their transfers and team strategies.
Category 4: Player ValueΒΆ
price_bins = [3.5, 4.9, 5.5, 6.0, 7.9, 15.5] #defining price bins and labels for our data
price_labels = ["4.0-4.9", "5.0-5.5", "5.6-6.0", "6.1-7.9", "8.0+"]
master_cleaned['price_range'] = pd.cut(master_cleaned['value'], bins=price_bins, labels=price_labels, right=False)
#players prices change throughout the season si we choose price in the first GW of the sewasoon
start_of_season_prices = (master_cleaned.sort_values(['Season', 'GW']).groupby(['name', 'position', 'Season'])['price_range'].first().reset_index())
total_season_points = (master_cleaned.groupby(['name', 'position', 'Season'])['total_points'].sum().reset_index()) #calculating total season points for each player
total_season_points = total_season_points.merge(start_of_season_prices, on=['name', 'position', 'Season'], how='left') # mergin start-of-season price range into the total_season_points DataFrame
N = 50 # yall know what this is by now
top_n_players_season = (total_season_points.groupby(['Season']).apply(lambda group: group.nlargest(N, 'total_points')).reset_index(drop=True))
avg_points_by_price_position = (top_n_players_season.groupby(['price_range', 'position'])['total_points'].mean().reset_index()) # calculating avg total season points for each price bin and position across seasons
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_points_by_price_position,x='price_range',y='total_points',hue='position')
plt.title(f"Average Total Season Points by Price Range and Position (Top {N} Players, All Seasons)", fontsize=16)
plt.xlabel("Price Range", fontsize=12)
plt.ylabel("Average Total Season Points", fontsize=12)
plt.legend(title="Position", loc='upper left')
plt.show()
The bar chart displays the average total season points for the top 50 players, grouped by price range and position, aggregated across all seasons. Midfielders and forwards are not represented in the Β£4β4.5 million price range because players in these positions are rarely priced this low. When they are, they usually don't feature as regular starters, which is why their performance data is not included in this bin. Similarly GKs abd DEFs are never priced in the premium 8 million + range (typically dominated by MIDs and FWDs). High price defenders tyipcally outperform mid priced midfielders and fwds (5.6 - 6 mil bracket). Preimum mids and fwds score significantly more points, reflecting their premium cost and contribution.
This analysis can help managers with bargain hunting. For example, a manger with a 5.6-6 million budget would be better served choosing a defender. However, if that manager had over 6 million, we see that forwards and midfielders outperform defenders, so he would be better served pursuing an offensive purchase strategy.
Specific Player AnalysisΒΆ
Top players by total pointsΒΆ
top_players = master_cleaned.nlargest(80, 'cumulative_points')
sns.barplot(x='cumulative_points', y='name', data=top_players)
plt.title("Top 10 Players by Total Points")
plt.xlabel("Total Points")
plt.ylabel("Player Name")
plt.show()
This chart highlights the top performers in terms of total points, with Erling Haaland, Mohamed Salah, and Harry Kane leading the list. These players are likely to have consistent performance across matches and recurring impressive performances. The ranking provides insight for team selection, especially for fantasy leagues, by identifying players who contribute the most points.
Below we take a random player (Cole Palmer, position = MID) and analyze his performance for FPL insights.
cole_palmer_data = master_cleaned[(master_cleaned['name'] == "Cole Palmer") & (master_cleaned['Season'] == "2023-2024")].copy()
# Sort by Gameweek to ensure proper ordering
cole_palmer_data = cole_palmer_data.sort_values(by='GW')
# Plot cumulative points on the primary y-axis and key metrics on the secondary y-axis
fig, ax1 = plt.subplots(figsize=(10, 7))
# Primary y-axis for cumulative points
line1 = ax1.plot(cole_palmer_data['GW'], cole_palmer_data['cumulative_points'], label="Cumulative Points", color='blue', marker='o', linestyle='-', linewidth=2)
ax1.set_xlabel("Gameweek", fontsize=12)
ax1.set_ylabel("Cumulative Points", fontsize=12, color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
# Secondary y-axis for key metrics (BPS, Threat, Influence)
ax2 = ax1.twinx()
line2 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['bps'].cumsum(), label="BPS", color='green', marker='o', linestyle='--', linewidth=1.5)
line3 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['threat'].cumsum(), label="Threat", color='orange', marker='o', linestyle='--', linewidth=1.5)
line4 = ax2.plot(cole_palmer_data['GW'], cole_palmer_data['influence'].cumsum(), label="Influence", color='purple', marker='o', linestyle='--', linewidth=1.5)
ax2.set_ylabel("Metric Values", fontsize=12, color='black')
ax2.tick_params(axis='y', labelcolor='black')
# Combine legends
lines = line1 + line2 + line3 + line4
labels = [l.get_label() for l in lines]
ax1.legend(lines, labels, loc="upper left", fontsize=10)
# Add title and grid
plt.title("Cole Palmer 23/24: Cumulative Points vs Key Metrics by Gameweek", fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
- This visualization ties Palmerβs weekly performance metrics to his cumulative contributions.
- The metrics of Threat, Influence, and Bonus Points System (BPS) are critical drivers of total points in FPL, as they directly reflect a player's attacking potential, overall impact on matches, and consistency in earning bonus points.
- Peaks in BPS and Threat (e.g., Gameweek 33) coincide with major increases in cumulative points.
- The cumulative assessment effectively captures performance over time, highlighting both consistency and standout moments across gameweeks, offering a comprehensive view of a player's contribution.
# Define a function to normalize the metrics across all players to a range of [0, 1]
def normalize(series):
return (series - series.min()) / (series.max() - series.min()) if series.max() != series.min() else series / series.max()
# Select the players for comparison
players = ['Cole Palmer', 'Bukayo Saka']
metrics = ['total_points', 'bps', 'threat', 'influence', 'expected_goals', 'expected_assists', 'goals_scored', 'assists', 'value']
# Normalize the metrics across all players first
master_cleaned[metrics] = master_cleaned[metrics].apply(normalize)
# Filter the data for the selected players and calculate their mean metrics
player_data = master_cleaned[master_cleaned['name'].isin(players)].groupby('name')[metrics].mean()
# Create radar chart
categories = metrics
num_vars = len(categories)
# Compute angle for each metric
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]
# Plot data for each player
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
for player in player_data.index:
values = player_data.loc[player].tolist()
values += values[:1] # Close the radar chart
# ax.fill(angles, values, alpha=0.25, label=player)
ax.plot(angles, values, linewidth=2, label=player)
# Add descriptors
ax.set_yticks([])
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=10)
plt.title("Player Performance Metrics: Cole Palmer vs Bukayo Saka", fontsize=12, fontweight='bold', pad=20)
plt.legend(bbox_to_anchor=(1.2, 1.1), fontsize=10)
plt.show()
Here we are comparing key performance metrics for two elite midfielders. Typically the budget limits how many elite members you can have on your team so when making tough decisions it is important to look at cross-cutting metrics to make a better judgment.
We graphed a radar plot to show how the two players' performances compare in important metrics. Palmer outperforms Saka in nearly every metric, including our label total_points. Despite this, Saka is valued significantly higher.
This goes to show that the right metric analysis can help managers choose better players and prevent potential value and reputational bias.
Correlation OutcomesΒΆ
columns_of_interest = [
'value', 'expected_goals', 'expected_assists', 'minutes',
'clean_sheets', 'saves', 'bps', 'was_home',
'minutes', 'creativity', 'influence', 'threat'
]
# Calculate correlations with `total_points`
correlation_dict = {col: master_cleaned[col].corr(master_cleaned['total_points']) for col in columns_of_interest}
# Convert to a pandas Series for sorting
correlation_series = pd.Series(correlation_dict).sort_values()
# Plot the ascending bar chart
plt.figure(figsize=(12, 8))
# Plot the bars with distinct colors
bars = plt.barh(correlation_series.index, correlation_series.values, edgecolor='black')
plt.title('Correlation between Player Metrics and Total Points', fontsize=16, fontweight='bold')
plt.xlabel('Correlation Coefficient', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
This graph shows the correlation coefficients between various player metrics and their total points in FPL. The Bonus Points System (BPS) and influence have the highest positive correlation with total points, indicating that they are strong predictors of player performance. Metrics such as clean sheets, expected goals, threat and minutes also show strong positive correlations, reflecting their importance in contributing to overall player scores. Metrics like saves and was_home have weaker correlations, suggesting their impact on total points is more situational or specific to certain player types, such as goalkeepers.
Key TakeawaysΒΆ
- The key takeaway from all the graphs is that player performance in FPL is multifaceted.
- A combination of metrics such as BPS (Bonus Points System), Influence, Threat, Expected Goals (xG), Goals Scored, and Clean Sheets... etc provide a more comprehensive understanding.
- Position analysis: Midfielders and Forwards are consistently amongst the highest scoring positions. Metrics related to goals and assists seem to better describe and predict their behavior. Defenders and Goalkeepers seem to be relatively better correlated with metrics like clean sheets and/or saves (Gks) for their performance.
- Home advantage is evident in the gameweek trend, with higher total points scored during home games. This can be used to a manager's advantage during home fixtures.
- Advanced metrics like xG and xA (underlying metrics) are good assessors of underlying data i.e., they measure whether a player is consistently incurring quality chances despite what an outcome based metric like goals scored or assists provided illustrates.
- Value-for-Money: Value of player can be a misleading metric on its own, when looked at with other metrics, managers can make better judgements to optimize their squad within the budget constraints.